A comparison of neural network feature transforms for speaker diarization
نویسندگان
چکیده
Speaker diarization finds contiguous speaker segments in an audio stream and clusters them by speaker identity, without using a-priori knowledge about the number of speakers or enrollment data. Diarization typically clusters speech segments based on short-term spectral features. In prior work, we showed that neural networks can serve as discriminative feature transformers for diarization by training them to perform same/different speaker comparisons on speech segments, yielding improved diarization accuracy when combined with standard MFCC-based models. In this work, we explore a wider range of neural network architectures for feature transformation, by adding additional layers and nonlinearities, and by varying the objective function during training. We find that the original speaker comparison network can be improved by adding a nonlinear transform layer, and that further gains are possible by training the network to perform speaker classification rather than comparison. Overall we achieve relative reductions in speaker error between 18% and 34% on a variety of test data from the AMI, ICSI, and NIST-RT corpora.
منابع مشابه
Speaker diarization of spontaneous meeting room conversations
Speaker diarization is the task of identifying “who spoke when” in an audio stream containing multiple speakers. This is an unsupervised task as there is no a priori information about the speakers. Diagnostical studies on state-of-the-art diarization systems have isolated three main issues with the systems; overlapping speech, effects of background noise and speech/nonspeech detection errors on...
متن کاملA Triplet Ranking-Based Neural Network for Speaker Diarization and Linking
This paper investigates a novel neural scoring method, based on conventional i-vectors, to perform speaker diarization and linking of large collections of recordings. Using triplet loss for training, the network projects i-vectors in a space that better separates speakers in terms of cosine similarity. Experiments are run on two French TV collections built from REPERE [1] and ETAPE [2] campaign...
متن کاملDetection and Handling of Overlapping Speech for Speaker Diarization
This thesis concerns the detection of overlapping speech segments and its further application for the improvement of speaker diarization performance. We propose the use of three spatial cross-correlationbased parameters for overlap detection on distant microphone channel data. Spatial features from di↵erent microphone pairs are fused by means of principal component analysis or by an approach in...
متن کاملشبکه عصبی پیچشی با پنجرههای قابل تطبیق برای بازشناسی گفتار
Although, speech recognition systems are widely used and their accuracies are continuously increased, there is a considerable performance gap between their accuracies and human recognition ability. This is partially due to high speaker variations in speech signal. Deep neural networks are among the best tools for acoustic modeling. Recently, using hybrid deep neural network and hidden Markov mo...
متن کاملSpeaker Diarization with LSTM
For many years, i-vector based audio embedding techniques were the dominant approach for speaker verification and speaker diarization applications. However, mirroring the rise of deep learning in various domains, neural network based audio embeddings, also known as d-vectors, have consistently demonstrated superior speaker verification performance. In this paper, we build on the success of dvec...
متن کامل